HealthCheck log output options #23900

Honny1 · 2024-09-09T13:24:19Z

This PR creates three new flags that can affect the output of the HealtCheck log.

Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via podman inspect. It is also limited to the last five executions and the first 500 characters per execution.

This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic healthcheck service failed record.

The --health-log-destination flag sets the destination of the HealthCheck log.
- none: (default behavior) HealthCheckResults are stored in overlay containers. (For example: ./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log)
- directory: creates a log file named <container-ID>-healthcheck.log with JSON HealthCheckResults in the specified directory.
- events_logger: The log will be written with logging mechanism set by events_logger.
The --health-max-log-count flag sets the maximum number of attempts in the HealthCheck log file.
- A value of 0 indicates an infinite number of attempts in the log file.
- The default value is 5 attempts in the log file.
The --health-max-log-size flag sets the maximum length of the log stored.
- A value of 0 indicates an infinite log length.
- The default value is 500 log characters.

Does this PR introduce a user-facing change?

Added --health-log-destination, --health-max-log-count and --health-max-log-size flags that affect HealtCheck log output.

Fixes: RHEL-24623

Honny1 · 2024-09-24T11:22:56Z

@mheon @Luap99 PTAL, I have resolved the feedback and checked that the defaults are propagated correctly.

Luap99 · 2024-09-24T12:02:05Z

libpod/events/journal_linux.go

+		if ee.HealthLog != "" {
+			m["PODMAN_HEALTH_LOG"] = ee.HealthLog
+		}
+		m["PODMAN_HEALTH_FAILING_STREAK"] = strconv.Itoa(ee.HealthFailingStreak)


I think we should only write this when we actually write a hc event.
Right now all events would get PODMAN_HEALTH_FAILING_STREAK set to 0 which doesn't make sense, I guess the same issue applies to PODMAN_HEALTH_STATUS

Luap99 · 2024-09-24T12:06:00Z

libpod/define/container_inspect.go

+	// HealthLogDestination defines the destination where the log is stored
+	HealthLogDestination string `json:"HealthLogDestination,omitempty" default:"local"`
+	// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
+	// ('0' value means an infinite number of attempts in the log file)
+	HealthMaxLogCount uint `json:"HealthcheckMaxLogCount,omitempty" default:"5"`
+	// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
+	// ("0" value means an infinite log length)
+	HealthMaxLogSize uint `json:"HealthcheckMaxLogSize,omitempty" default:"500"`


What uses the default key? AFAICT we never used this syntax anywhere?

I thought it would force a default value. For example, instead of 0 to 5. But it's not. I'll remove it.

Luap99 · 2024-09-24T12:06:31Z

libpod/container_config.go

+	// HealthLogDestination defines the destination where the log is stored
+	HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"`
+	// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
+	// ('0' value means an infinite number of attempts in the log file)
+	HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"`
+	// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
+	// ("0" value means an infinite log length)
+	HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"`


same here default key, this is a duplication if we ever change the default

Luap99 · 2024-09-24T12:09:31Z

pkg/specgen/generate/container_create.go

+	} else {
+		options = append(options, libpod.WithHealthCheckLogDestination(define.DefaultHealthCheckLocalDestination))
+		options = append(options, libpod.WithHealthCheckMaxLogCount(define.DefaultHealthMaxLogCount))
+		options = append(options, libpod.WithHealthCheckMaxLogSize(define.DefaultHealthMaxLogSize))
+	}
+


This is a bit confusing, I think it is better when defaults are set in libpod on not by specgen.

Luap99 · 2024-09-24T12:09:42Z

pkg/specgen/specgen.go

+	// HealthLogDestination defines the destination where the log is stored
+	HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"`
+	// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
+	// ('0' value means an infinite number of attempts in the log file)
+	HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"`
+	// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
+	// ("0" value means an infinite log length)
+	HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"`


same here with the default duplciation.

Luap99 · 2024-09-24T12:16:07Z

test/system/220-healthcheck.bats

+    count=$(grep -o "$msg" <<< "$output" | wc -l)
+    assert "$count" -eq 5


I don't think this produces nice errors either but I leave to @edsantiago to judge.

Luap99 · 2024-09-24T12:17:58Z

test/system/220-healthcheck.bats

@@ -273,4 +273,215 @@ Log[-1].Output   | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\"
    done
 }

+@test "podman healthcheck --health-max-log-count default value (5)" {


It seem there is rather significant duplication in what is being tested in e2e and system tests. We should not have duplicate tests as this just wastes CI time. One of them is good enough. I think using system tests make sense here as they run in gating.

And it is extremely difficult for me to figure out if these tests are indeed the same or if there are some minor differences.

The tests test the same functionality. I will remove the e2e tests.

Luap99 · 2024-09-24T12:18:54Z

test/system/220-healthcheck.bats

+        is "$output" "" "output from 'podman healthcheck run'"
+    done
+
+    # sys podman tests executes healthcheck twice the first time healthcheck is run


This is not correct, you are running them twice... The healthcheck is trigger by podman when the container is started but because you manually run_podman healthcheck run you run it twice not podman.

edsantiago

Way too much duplication in the tests. Can you find a way to refactor? Also a few other suggestions for corrections.

I'll take another look once you deduplicate.

edsantiago · 2024-09-24T12:30:22Z

test/system/220-healthcheck.bats

@@ -273,4 +273,215 @@ Log[-1].Output   | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\"
    done
 }

+@test "podman healthcheck --health-max-log-count default value (5)" {
+    local repeat_count=10
+    local msg="Hello, How are you?"


It's good practice to use randomized strings:

local msg="healthmsg-$(random_string)"

This prevents false negatives from caching, or copypasta, or various other types of logic errors. The "healthmsg" prefix will make it very easy for a future maintainer, looking at test failures, to see what type of string this is and to grep the code for it.

edsantiago · 2024-09-24T12:34:28Z

test/system/220-healthcheck.bats

+    cid="$output"
+
+    run_podman inspect $ctrname --format "{{.Config.HealthMaxLogCount}}"
+    is "$output" "5" "HealthMaxLogCount is set to 5"


Please drop the "is set to 5": on failure, the diagnostic message will already say "expected 5, got (something else)". And if the expected value ever changes, someone might change the "5" but forget to change the "is set to 5", and the code will be impossible to understand.

Also, where does this magic 5 come from? Is it a hardcoded default in podman? Is it a containers.conf setting? (I could read the PR but this is way too huge right now). What will happen if/when that default changes or is overridden? Maybe the message should say "HealthMaxLogCount is the expected default", or some explanation of where it's coming from

edsantiago · 2024-09-24T12:38:16Z

test/system/220-healthcheck.bats

+    for _ in $(seq 1 $repeat_count);
+    do
+        run_podman healthcheck run $ctrname
+        is "$output" "" "output from 'podman healthcheck run'"


Suggestion: "unexpected output from podman healthcheck run (pass $i)" (and use i instead of _). Any time there's a loop, it's nice to include the variables in the error message. Not super helpful in this particular case, but it's a good habit to get into.

edsantiago · 2024-09-24T12:40:35Z

test/system/220-healthcheck.bats

+
+
+    run_podman inspect $ctrname --format "{{.State.Health.Log}}"
+    count=$(grep -o "$msg" <<< "$output" | wc -l)


Why not grep -c? As in, count=$(grep -co "$msg" <<<"$output")

edsantiago · 2024-09-24T12:41:35Z

test/system/220-healthcheck.bats

+
+    run_podman inspect $ctrname --format "{{.State.Health.Log}}"
+    count=$(grep -o "$msg" <<< "$output" | wc -l)
+    assert "$count" -eq 5


Please explain the assertion, e.g. assert "$count" -eq 5 "Number of matching health log lines"

Honny1 · 2024-09-24T19:05:06Z

@edsantiago @Luap99 PTAL, I've modified the code according to your feedback.

edsantiago · 2024-09-24T19:15:14Z

Looking, but, test flakes rootless on my laptop:

✗ |220| podman healthcheck --health-max-log-size infinite value (0) [3485]
...
...very very very very long string
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Number of matching health log messages
#| expected: -eq 2
#|   actual:     1
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Honny1 · 2024-09-24T19:30:11Z

This is caused by the fact that podman run creates and starts the systemd timer right away, which starts the first run of HealtCheck when the container is created. Then the podman healthcheck run is manually triggered again in the test. That's why you can get a second run. However, this depends on systemd. I can avoid this by using the -ge comparison. WDYT? @edsantiago

edsantiago

Nice refactoring. A few suggestions and requests inline. As for the flake, I have no opinion yet, don't understand the code well enough to comment.

One more suggestion: could you please add comments in important places? This all makes perfect sense to you, because you're deeply immersed in the code today, but it is very hard for me to read, and one day will be very hard for you too. Simple comments (some suggestions below) can save a maintainer precious time.

Thank you!

edsantiago · 2024-09-24T20:42:03Z

test/system/220-healthcheck.bats

+    local ctrname="c-h-$(safename)"
+    _create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "--health-max-log-count  $repeat_count" "$repeat_count" "HealthMaxLogCount"
+
+    # This is run 11 times to check that the cap is working.


Suggestion: is run one more time than max-count. And instead of 0 $repeat_count, which is too subtle for a quick glance, $(seq 1 $repeat_count) plusone

edsantiago · 2024-09-24T20:44:30Z

test/system/220-healthcheck.bats

+    run_podman healthcheck run $ctrname
+    is "$output" "" "output from 'podman healthcheck run'"
+
+    _check_health_log $ctrname "healthmsg-}]$" 1


Suggestion: substr=${msg:0:10} then _check ... "$substr}]\$". With an explanation that max-log-size truncates the echo message, because there are just way too many options and way too many magic "10"s all over, and it's confusing...

edsantiago · 2024-09-24T20:44:57Z

test/system/220-healthcheck.bats

+
+
+@test "podman healthcheck --health-log-destination file" {
+    local TMP_DIR_HEALTHCHECK="$BATS_FILE_TMPDIR/healthcheck"


Please use $PODMAN_TMPDIR. That's unique to each test.

edsantiago · 2024-09-24T20:45:48Z

test/system/220-healthcheck.bats

+    run cat $healthcheck_log_path
+    count=$(grep -co "$msg" <<< "$output")


Suggestion: the run cat is unnecessary, you can just count=$(grep -co "$msg" $healthcheck_log_path)

edsantiago · 2024-09-24T20:50:26Z

test/system/220-healthcheck.bats

+
+    # The healthcheck is triggered by the podman when the container is started, but its execution depends on systemd.
+    # And since `run_podman healthcheck run` is also run manually, it will result in two runs.
+    run journalctl --output cat --output-fields=PODMAN_HEALTH_LOG PODMAN_ID=$cid


Bats run is evil. Or at least fiendishly tricky. It hides a lot of important stuff. Suggestion:

cmd="journalctl --output cat ...etc" echo "$_LOG_PROMPT $cmd" run $cmd echo "$output" assert "$status" -eq 0 "exit status of journalctl"

This is tedious and it's very unfortunate that we have to do things like this... but a future maintainer will greatly appreciate it when the test fails.

edsantiago · 2024-09-24T20:53:10Z

test/system/220-healthcheck.bats

+}
+
+@test "podman healthcheck --health-max-log-count default value (5)" {
+    local repeat_count=10


Suggestion: eliminate

edsantiago · 2024-09-24T20:54:05Z

test/system/220-healthcheck.bats

+    local ctrname="c-h-$(safename)"
+    _create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "" "5" "HealthMaxLogCount is the expected default"
+
+    for i in $(seq 1 $repeat_count);


Just hardcode 10, and add a comment above, something like "run many more times than max count; confirm that excess ones are not logged"

…nation flags These flags can affect the output of the HealtCheck log. Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via `podman inspect`. It is also limited to the last five executions and the first 500 characters per execution. This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic `healthcheck service failed` record. - The `--health-log-destination` flag sets the destination of the HealthCheck log. - `none`: (default behavior) `HealthCheckResults` are stored in overlay containers. (For example: `$runroot/healthcheck.log`) - `directory`: creates a log file named `<container-ID>-healthcheck.log` with JSON `HealthCheckResults` in the specified directory. - `events_logger`: The log will be written with logging mechanism set by events_loggeri. It also saves the log to a default directory, for performance on a system with a large number of logs. - The `--health-max-log-count` flag sets the maximum number of attempts in the HealthCheck log file. - A value of `0` indicates an infinite number of attempts in the log file. - The default value is `5` attempts in the log file. - The `--health-max-log-size` flag sets the maximum length of the log stored. - A value of `0` indicates an infinite log length. - The default value is `500` log characters. Add --health-max-log-count flag Signed-off-by: Jan Rodák <[email protected]> Add --health-max-log-size flag Signed-off-by: Jan Rodák <[email protected]> Add --health-log-destination flag Signed-off-by: Jan Rodák <[email protected]>

Honny1 · 2024-09-25T12:12:06Z

@edsantiago PTAL, I've modified the tests according to your suggestions.

packit-as-a-service · 2024-09-25T12:12:52Z

Ephemeral COPR build failed. @containers/packit-build please check.

Luap99

/lgtm

openshift-ci · 2024-09-26T11:53:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Sep 9, 2024

Honny1 force-pushed the healthcheck-log branch 11 times, most recently from 331ce42 to ebadb8b Compare September 10, 2024 15:58

containers deleted a comment from packit-as-a-service bot Sep 10, 2024

Honny1 force-pushed the healthcheck-log branch 16 times, most recently from bf7e1cb to 11b35a5 Compare September 13, 2024 12:10

Honny1 force-pushed the healthcheck-log branch 2 times, most recently from 244c8a9 to eb80e95 Compare September 24, 2024 09:51

containers deleted a comment from packit-as-a-service bot Sep 24, 2024

Honny1 force-pushed the healthcheck-log branch from eb80e95 to 9d1cc56 Compare September 24, 2024 10:58

Luap99 reviewed Sep 24, 2024

View reviewed changes

edsantiago requested changes Sep 24, 2024

View reviewed changes

Honny1 force-pushed the healthcheck-log branch 3 times, most recently from 909bc6d to 49f036d Compare September 24, 2024 19:03

Honny1 requested a review from edsantiago September 24, 2024 19:03

edsantiago requested changes Sep 24, 2024

View reviewed changes

Honny1 force-pushed the healthcheck-log branch from 49f036d to 465b01d Compare September 25, 2024 09:11

Honny1 force-pushed the healthcheck-log branch from 465b01d to de856da Compare September 25, 2024 12:01

containers deleted a comment from packit-as-a-service bot Sep 25, 2024

edsantiago approved these changes Sep 25, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2024

containers deleted a comment from packit-as-a-service bot Sep 25, 2024

Honny1 requested a review from Luap99 September 25, 2024 15:28

Luap99 approved these changes Sep 26, 2024

View reviewed changes

openshift-ci bot assigned Luap99 Sep 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2024

openshift-merge-bot bot merged commit 4e38381 into containers:main Sep 26, 2024
74 of 80 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HealthCheck log output options #23900

HealthCheck log output options #23900

Honny1 commented Sep 9, 2024 •

edited

Loading

Honny1 commented Sep 24, 2024

Luap99 Sep 24, 2024

Luap99 Sep 24, 2024

Honny1 Sep 24, 2024

Luap99 Sep 24, 2024

Luap99 Sep 24, 2024

Luap99 Sep 24, 2024

Luap99 Sep 24, 2024

Luap99 Sep 24, 2024

Honny1 Sep 24, 2024

Luap99 Sep 24, 2024

edsantiago left a comment

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

Honny1 commented Sep 24, 2024

edsantiago commented Sep 24, 2024

Honny1 commented Sep 24, 2024

edsantiago left a comment

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

edsantiago Sep 24, 2024

Honny1 commented Sep 25, 2024

packit-as-a-service bot commented Sep 25, 2024

Luap99 left a comment

openshift-ci bot commented Sep 26, 2024

		count=$(grep -o "$msg" <<< "$output" \| wc -l)
		assert "$count" -eq 5



		run_podman inspect $ctrname --format "{{.State.Health.Log}}"
		count=$(grep -o "$msg" <<< "$output" \| wc -l)



		@test "podman healthcheck --health-log-destination file" {
		local TMP_DIR_HEALTHCHECK="$BATS_FILE_TMPDIR/healthcheck"

		run cat $healthcheck_log_path
		count=$(grep -co "$msg" <<< "$output")

HealthCheck log output options #23900

HealthCheck log output options #23900

Conversation

Honny1 commented Sep 9, 2024 • edited Loading

Does this PR introduce a user-facing change?

Honny1 commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Sep 24, 2024

edsantiago commented Sep 24, 2024

Honny1 commented Sep 24, 2024

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Sep 25, 2024

packit-as-a-service bot commented Sep 25, 2024

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 26, 2024

Honny1 commented Sep 9, 2024 •

edited

Loading